In [42]:
%matplotlib inline

In [2]:
import nltk

What is NLP?

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this tutorial we’ll explore how to do that using Python, the Natural Language Toolkit (NLTK) and Gensim.

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

Quick Overview of NLTK

NLTK was written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). The NTLK library provides a combination of natural language corpora, lexical resources, and example grammars with language processing algorithms, methodologies and demonstrations for a very Pythonic "batteries included" view of natural language processing.

As such, NLTK is perfect for research-driven (hypothesis-driven) workflows for agile data science.

Installing NLTK

This notebook has a few dependencies, most of which can be installed via the python package manger - pip.

  1. Python 2.7+ or 3.5+ (Anaconda is ok)
  2. NLTK
  3. The NLTK corpora
  4. The BeautifulSoup library
  5. The gensim libary

Once you have Python and pip installed you can install NLTK from the terminal as follows:

~$ pip install nltk
~$ pip install matplotlib
~$ pip install beautifulsoup4
~$ pip install gensim

Note that these will also install Numpy and Scipy if they aren't already installed.

What NLTK Includes

NLTK is a useful pedagogical resource for learning NLP with Python and serves as a starting place for producing production grade code that requires natural language analysis. It is also important to understand what NLTK is not.

What NLTK is Not

  • Production ready out of the box
  • Lightweight
  • Generally applicable
  • Magic

NLTK provides a variety of tools that can be used to explore the linguistic domain but is not a lightweight dependency that can be easily included in other workflows, especially those that require unit and integration testing or other build processes. This stems from the fact that NLTK includes a lot of added code but also a rich and complete library of corpora that power the built-in algorithms.

The Good Parts of NLTK

The Bad parts of NLTK

  • Syntactic Parsing

    • No included grammar (not a black box)
    • No Feature/Dependency Parsing
    • No included feature grammar
  • The sem package

    • Toy only (lambda-calculus & first order logic)
  • Lots of extra stuff (heavyweight dependency)

    • papers, chat programs, alignments, etc.

Knowing the good and the bad parts will help you explore NLTK further - looking into the source code to extract the material you need, then moving that code to production. We will explore NLTK in more detail in the rest of this notebook.

Obtaining and Exploring the NLTK Corpora

NLTK ships with a variety of corpora, let's use a few of them to do some work. To download the NLTK corpora, open a Python interpreter:

import nltk
nltk.download()

This will open up a window with which you can download the various corpora and models to a specified location. For now, go ahead and download it all as we will be exploring as much of NLTK as we can. Also take note of the download_directory - you're going to want to know where that is so you can get a detailed look at the corpora that's included. I usually export an environment variable to track this. You can do this from your terminal:

~$ export NLTK_DATA=/path/to/nltk_data

In [3]:
# Take a moment to explore what is in this directory
dir(nltk)


Out[3]:
['AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'DependencyGraph',
 'DependencyProduction',
 'DictionaryConditionalProbDist',
 'DictionaryProbDist',
 'DiscourseTester',
 'DrtExpression',
 'DrtGlueReadingCommand',
 'ELEProbDist',
 'EarleyChartParser',
 'Expression',
 'FStructure',
 'FeatDict',
 'FeatList',
 'FeatStruct',
 'FeatStructReader',
 'Feature',
 'FeatureBottomUpChartParser',
 'FeatureBottomUpLeftCornerChartParser',
 'FeatureChartParser',
 'FeatureEarleyChartParser',
 'FeatureIncrementalBottomUpChartParser',
 'FeatureIncrementalBottomUpLeftCornerChartParser',
 'FeatureIncrementalChartParser',
 'FeatureIncrementalTopDownChartParser',
 'FeatureTopDownChartParser',
 'FreqDist',
 'HTTPPasswordMgrWithDefaultRealm',
 'HeldoutProbDist',
 'HiddenMarkovModelTagger',
 'HiddenMarkovModelTrainer',
 'HunposTagger',
 'IBMModel',
 'IBMModel1',
 'IBMModel2',
 'IBMModel3',
 'IBMModel4',
 'IBMModel5',
 'ISRIStemmer',
 'ImmutableMultiParentedTree',
 'ImmutableParentedTree',
 'ImmutableProbabilisticMixIn',
 'ImmutableProbabilisticTree',
 'ImmutableTree',
 'IncrementalBottomUpChartParser',
 'IncrementalBottomUpLeftCornerChartParser',
 'IncrementalChartParser',
 'IncrementalLeftCornerChartParser',
 'IncrementalTopDownChartParser',
 'Index',
 'InsideChartParser',
 'JSONTaggedDecoder',
 'JSONTaggedEncoder',
 'KneserNeyProbDist',
 'LancasterStemmer',
 'LaplaceProbDist',
 'LazyConcatenation',
 'LazyEnumerate',
 'LazyMap',
 'LazySubsequence',
 'LazyZip',
 'LeftCornerChartParser',
 'LidstoneProbDist',
 'LineTokenizer',
 'LogicalExpressionException',
 'LongestChartParser',
 'MLEProbDist',
 'MWETokenizer',
 'Mace',
 'MaceCommand',
 'MaltParser',
 'MaxentClassifier',
 'Model',
 'MultiClassifierI',
 'MultiParentedTree',
 'MutableProbDist',
 'NaiveBayesClassifier',
 'NaiveBayesDependencyScorer',
 'NgramAssocMeasures',
 'NgramTagger',
 'NonprojectiveDependencyParser',
 'Nonterminal',
 'OrderedDict',
 'PCFG',
 'Paice',
 'ParallelProverBuilder',
 'ParallelProverBuilderCommand',
 'ParentedTree',
 'ParserI',
 'PerceptronTagger',
 'PhraseTable',
 'PorterStemmer',
 'PositiveNaiveBayesClassifier',
 'ProbDistI',
 'ProbabilisticDependencyGrammar',
 'ProbabilisticMixIn',
 'ProbabilisticNonprojectiveParser',
 'ProbabilisticProduction',
 'ProbabilisticProjectiveDependencyParser',
 'ProbabilisticTree',
 'Production',
 'ProjectiveDependencyParser',
 'Prover9',
 'Prover9Command',
 'ProxyBasicAuthHandler',
 'ProxyDigestAuthHandler',
 'ProxyHandler',
 'PunktSentenceTokenizer',
 'QuadgramCollocationFinder',
 'RSLPStemmer',
 'RTEFeatureExtractor',
 'RandomChartParser',
 'RangeFeature',
 'ReadingCommand',
 'RecursiveDescentParser',
 'RegexpChunkParser',
 'RegexpParser',
 'RegexpStemmer',
 'RegexpTagger',
 'RegexpTokenizer',
 'ResolutionProver',
 'ResolutionProverCommand',
 'SExprTokenizer',
 'SLASH',
 'Senna',
 'SennaChunkTagger',
 'SennaNERTagger',
 'SennaTagger',
 'SequentialBackoffTagger',
 'ShiftReduceParser',
 'SimpleGoodTuringProbDist',
 'SklearnClassifier',
 'SlashFeature',
 'SnowballStemmer',
 'SpaceTokenizer',
 'StackDecoder',
 'StanfordNERTagger',
 'StanfordPOSTagger',
 'StanfordSegmenter',
 'StanfordTagger',
 'StanfordTokenizer',
 'StemmerI',
 'SteppingChartParser',
 'SteppingRecursiveDescentParser',
 'SteppingShiftReduceParser',
 'TYPE',
 'TabTokenizer',
 'TableauProver',
 'TableauProverCommand',
 'TaggerI',
 'TestGrammar',
 'Text',
 'TextCat',
 'TextCollection',
 'TextTilingTokenizer',
 'TnT',
 'TokenSearcher',
 'TopDownChartParser',
 'TransitionParser',
 'Tree',
 'TreebankWordTokenizer',
 'Trie',
 'TrigramAssocMeasures',
 'TrigramCollocationFinder',
 'TrigramTagger',
 'TweetTokenizer',
 'TypedMaxentFeatureEncoding',
 'Undefined',
 'UniformProbDist',
 'UnigramTagger',
 'UnsortedChartParser',
 'Valuation',
 'Variable',
 'ViterbiParser',
 'WekaClassifier',
 'WhitespaceTokenizer',
 'WittenBellProbDist',
 'WordNetLemmatizer',
 'WordPunctTokenizer',
 '__author__',
 '__author_email__',
 '__builtins__',
 '__cached__',
 '__classifiers__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__keywords__',
 '__license__',
 '__loader__',
 '__longdescr__',
 '__maintainer__',
 '__maintainer_email__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__url__',
 '__version__',
 'absolute_import',
 'accuracy',
 'add_logs',
 'agreement',
 'alignment_error_rate',
 'api',
 'app',
 'apply_features',
 'approxrand',
 'arity',
 'association',
 'bigrams',
 'binary_distance',
 'binary_search_file',
 'binding_ops',
 'bisect',
 'blankline_tokenize',
 'bleu',
 'bleu_score',
 'bllip',
 'boolean_ops',
 'boxer',
 'bracket_parse',
 'breadth_first',
 'brill',
 'brill_trainer',
 'build_opener',
 'call_megam',
 'casual',
 'casual_tokenize',
 'ccg',
 'chain',
 'chart',
 'chat',
 'choose',
 'chunk',
 'class_types',
 'classify',
 'clause',
 'clean_html',
 'clean_url',
 'cluster',
 'collocations',
 'combinations',
 'compat',
 'config_java',
 'config_megam',
 'config_weka',
 'conflicts',
 'confusionmatrix',
 'conllstr2tree',
 'conlltags2tree',
 'corpus',
 'crf',
 'custom_distance',
 'data',
 'decisiontree',
 'decorator',
 'decorators',
 'defaultdict',
 'demo',
 'dependencygraph',
 'deque',
 'discourse',
 'distance',
 'download',
 'download_gui',
 'download_shell',
 'downloader',
 'draw',
 'drt',
 'earleychart',
 'edit_distance',
 'elementtree_indent',
 'entropy',
 'equality_preds',
 'evaluate',
 'evaluate_sents',
 'everygrams',
 'extract_rels',
 'extract_test_sentences',
 'f_measure',
 'featstruct',
 'featurechart',
 'filestring',
 'flatten',
 'fractional_presence',
 'getproxies',
 'ghd',
 'glue',
 'grammar',
 'guess_encoding',
 'help',
 'hmm',
 'hunpos',
 'ibm1',
 'ibm2',
 'ibm3',
 'ibm4',
 'ibm5',
 'ibm_model',
 'ieerstr2tree',
 'in_idle',
 'induce_pcfg',
 'inference',
 'infile',
 'install_opener',
 'internals',
 'interpret_sents',
 'interval_distance',
 'invert_dict',
 'invert_graph',
 'is_rel',
 'islice',
 'isri',
 'jaccard_distance',
 'json_tags',
 'jsontags',
 'lancaster',
 'lazyimport',
 'lfg',
 'line_tokenize',
 'linearlogic',
 'load',
 'load_parser',
 'locale',
 'log_likelihood',
 'logic',
 'mace',
 'malt',
 'map_tag',
 'mapping',
 'masi_distance',
 'maxent',
 'megam',
 'memoize',
 'metrics',
 'misc',
 'mwe',
 'naivebayes',
 'ne_chunk',
 'ne_chunk_sents',
 'ngrams',
 'nonprojectivedependencyparser',
 'nonterminals',
 'numpy',
 'os',
 'pad_sequence',
 'paice',
 'parse',
 'parse_sents',
 'pchart',
 'perceptron',
 'pk',
 'porter',
 'pos_tag',
 'pos_tag_sents',
 'positivenaivebayes',
 'pprint',
 'pr',
 'precision',
 'presence',
 'print_function',
 'print_string',
 'probability',
 'projectivedependencyparser',
 'prover9',
 'punkt',
 'py25',
 'py26',
 'py27',
 'pydoc',
 'python_2_unicode_compatible',
 'raise_unorderable_types',
 'ranks_from_scores',
 'ranks_from_sequence',
 're',
 're_show',
 'read_grammar',
 'read_logic',
 'read_valuation',
 'recall',
 'recursivedescent',
 'regexp',
 'regexp_span_tokenize',
 'regexp_tokenize',
 'register_tag',
 'relextract',
 'resolution',
 'ribes',
 'ribes_score',
 'root_semrep',
 'rslp',
 'rte_classifier',
 'rte_classify',
 'rte_features',
 'rtuple',
 'scikitlearn',
 'scores',
 'segmentation',
 'sem',
 'senna',
 'sent_tokenize',
 'sequential',
 'set2rel',
 'set_proxy',
 'sexpr',
 'sexpr_tokenize',
 'shiftreduce',
 'simple',
 'sinica_parse',
 'six',
 'skipgrams',
 'skolemize',
 'slice_bounds',
 'snowball',
 'spearman',
 'spearman_correlation',
 'stack_decoder',
 'stanford',
 'stanford_segmenter',
 'stem',
 'str2tuple',
 'string_span_tokenize',
 'string_types',
 'subprocess',
 'subsumes',
 'sum_logs',
 'tableau',
 'tadm',
 'tag',
 'tagset_mapping',
 'tagstr2tree',
 'tbl',
 'text',
 'text_type',
 'textcat',
 'texttiling',
 'textwrap',
 'tkinter',
 'tnt',
 'tokenize',
 'tokenwrap',
 'toolbox',
 'total_ordering',
 'transitionparser',
 'transitive_closure',
 'translate',
 'tree',
 'tree2conllstr',
 'tree2conlltags',
 'treebank',
 'treetransforms',
 'trigrams',
 'tuple2str',
 'types',
 'unify',
 'unique_list',
 'untag',
 'usage',
 'util',
 'version_file',
 'version_info',
 'viterbi',
 'weka',
 'windowdiff',
 'word_tokenize',
 'wordnet',
 'wordpunct_tokenize',
 'wsd']

Methods for Working with Sample NLTK Corpora

To explore much of the built-in corpus, use the following methods:


In [5]:
# Lists the various corpora and CorpusReader classes in the nltk.corpus module
for name in dir(nltk.corpus):
    print(name)
    if name.islower() and not name.startswith('_'): print(name)


_LazyModule__lazymodule_globals
_LazyModule__lazymodule_import
_LazyModule__lazymodule_init
_LazyModule__lazymodule_loaded
_LazyModule__lazymodule_locals
_LazyModule__lazymodule_name
__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattr__
__getattribute__
__gt__
__hash__
__init__
__le__
__lt__
__module__
__name__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__str__
__subclasshook__
__weakref__

fileids()


In [7]:
# You can explore the titles with:
print(nltk.corpus.gutenberg.fileids())


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

In [8]:
# For a specific corpus, list the fileids that are available:
print(nltk.corpus.shakespeare.fileids())


['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', 'macbeth.xml', 'merchant.xml', 'othello.xml', 'r_and_j.xml']

text.Text()

The nltk.text.Text class is a wrapper around a sequence of simple (string) tokens - intended only for the initial exploration of text usually via the Python REPL. It has the following methods:

  • common_contexts
  • concordance
  • collocations
  • count
  • plot
  • findall
  • index

You shouldn't use this class in production level systems, but it is useful to explore (small) snippets of text in a meaningful fashion.

For example, you can get access to the text from Hamlet as follows:


In [9]:
hamlet = nltk.text.Text(nltk.corpus.gutenberg.words('shakespeare-hamlet.txt'))

concordance()

The concordance function performs a search for the given token and then also provides the surrounding context.


In [10]:
hamlet.concordance("king", 55, lines=10)


Displaying 10 of 172 matches:
elfe Bar . Long liue the King Fran . Barnardo ? Bar . 
e same figure , like the King that ' s dead Mar . Thou
. Lookes it not like the King ? Marke it Horatio Hora 
Mar . Is it not like the King ? Hor . As thou art to t
isper goes so : Our last King , Whose Image euen but n
mpetent Was gaged by our King : which had return ' d T
Secunda . Enter Claudius King of Denmarke , Gertrude t
elia , Lords Attendant . King . Though yet of Hamlet o
er To businesse with the King , more then the scope Of
 , will we shew our duty King . We doubt it nothing , 

similar()

Given some context surrounding a word, we can discover similar words, e.g. words that that occur frequently in the same context and with a similar distribution: Distributional similarity:

Note ContextIndex.similar_words(word) calculates the similarity score for each word as the sum of the products of frequencies in each context. Text.similar() simply counts the number of unique contexts the words share.

http://bit.ly/2a2udIr


In [11]:
print(hamlet.similar("marriage"))
austen = nltk.text.Text(nltk.corpus.gutenberg.words("austen-sense.txt"))
print()
print(austen.similar("marriage"))


faculty that it paris loue wrath funerall together greefe time heauen
t thewes clouds reputation eare forme
None

mother family time life head brother side affection engagement
feelings judgment carriage manners ease daughters eyes regard sister
visit hand
None

As you can see, this takes a bit of time to build the index in memory, one of the reasons it's not suggested to use this class in production code.

common_contexts()

Now that we can do searching and similarity, we find the common contexts of a set of words.


In [15]:
hamlet.common_contexts(["king", "father"])


a_that my_and

your turn, go ahead and explore similar words and contexts - what does the common context mean?

dispersion_plot()

NLTK also uses matplotlib and pylab to display graphs and charts that can show dispersions and frequency. This is especially interesting for the corpus of innagural addresses given by U.S. presidents.


In [16]:
inaugural = nltk.text.Text(nltk.corpus.inaugural.words())
inaugural.dispersion_plot(["citizens", "democracy", "freedom", "duty", "America"])


Stopwords


In [18]:
print(nltk.corpus.stopwords.fileids())
nltk.corpus.stopwords.words('english')

import string
print(string.punctuation)


['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

These corpora export several vital methods:

  • paras (iterate through each paragraph)
  • sents (iterate through each sentence)
  • words (iterate through each word)
  • raw (get access to the raw text)

paras()


In [19]:
corpus = nltk.corpus.brown
print(corpus.paras())


[[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']], [['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']], ...]

sents()


In [20]:
print(corpus.sents())


[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

words()


In [15]:
print(corpus.words())


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

raw()

Be careful!


In [16]:
print(corpus.raw()[:200]) # Be careful!



	The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' tha

Your turn! Explore some of the text in the available corpora

Frequency Analyses

In statistical machine learning approaches to NLP, the very first thing we need to do is count things - especially the unigrams that appear in the text and their relationships to each other. NLTK provides two excellent classes to enable these frequency analyses:

  • FreqDist
  • ConditionalFreqDist

And these two classes serve as the foundation for most of the probability and statistical analyses that we will conduct.

Zipf's Law

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. Read more on Wikipedia.

First we will compute the following:

  • The count of words
  • The vocabulary (unique words)
  • The lexical diversity (the ratio of word count to vocabulary)

In [17]:
reuters = nltk.corpus.reuters # Corpus of news articles
counts  = nltk.FreqDist(reuters.words())
vocab   = len(counts.keys())
words   = sum(counts.values())
lexdiv  = float(words) / float(vocab)

print("Corpus has %i types and %i tokens for a lexical diversity of %0.3f" % (vocab, words, lexdiv))


Corpus has 41600 types and 1720901 tokens for a lexical diversity of 41.368

counts()


In [18]:
counts.B()


Out[18]:
41600

most_common()

The n most common tokens in the corpus


In [19]:
print(counts.most_common(40))


[('.', 94687), (',', 72360), ('the', 58251), ('of', 35979), ('to', 34035), ('in', 26478), ('said', 25224), ('and', 25043), ('a', 23492), ('mln', 18037), ('vs', 14120), ('-', 13705), ('for', 12785), ('dlrs', 11730), ("'", 11272), ('The', 10968), ('000', 10277), ('1', 9977), ('s', 9298), ('pct', 9093), ('it', 8842), (';', 8762), ('&', 8698), ('lt', 8694), ('on', 8556), ('from', 7986), ('cts', 7953), ('is', 7580), ('>', 7449), ('that', 7377), ('its', 7265), ('by', 6872), ('"', 6816), ('at', 6537), ('2', 6528), ('U', 6388), ('S', 6382), ('year', 6310), ('be', 6288), ('with', 5945)]

counts.max()

The most frequent token in the corpus.


In [20]:
print(counts.max())


.

counts.hapaxes()

A list of all hapax legomena (words that only appear one time in the corpus).


In [21]:
print(counts.hapaxes()[0:10])


['72p', 'incurs', 'footwork', 'NEXL', 'arrogant', 'Formation', 'WJYE', 'simulation', 'auctioning', 'Chesapeake']

counts.freq()

The percentage of the corpus for the given token.


In [22]:
counts.freq('stipulate') * 100


Out[22]:
5.810909517746808e-05

counts.plot()

Plot the frequencies of the n most commonly occuring words.


In [23]:
counts.plot(50, cumulative=False)



In [24]:
# By setting cumulative to True, we can visualize the cumulative counts of the _n_ most common words.
counts.plot(50, cumulative=True)


ConditionalFreqDist()


In [24]:
from itertools import chain 

brown = nltk.corpus.brown
categories = brown.categories()

counts = nltk.ConditionalFreqDist(chain(*[[(cat, word) for word in brown.words(categories=cat)] for cat in categories]))

for category, dist in counts.items():
    vocab  = len(dist.keys())
    tokens = sum(dist.values())
    lexdiv = float(tokens) / float(vocab)
    print("%s: %i types with %i tokens and lexical diversity of %0.3f" % (category, vocab, tokens, lexdiv))


lore: 14503 types with 110299 tokens and lexical diversity of 7.605
religion: 6373 types with 39399 tokens and lexical diversity of 6.182
humor: 5017 types with 21695 tokens and lexical diversity of 4.324
hobbies: 11935 types with 82345 tokens and lexical diversity of 6.899
science_fiction: 3233 types with 14470 tokens and lexical diversity of 4.476
editorial: 9890 types with 61604 tokens and lexical diversity of 6.229
romance: 8452 types with 70022 tokens and lexical diversity of 8.285
belles_lettres: 18421 types with 173096 tokens and lexical diversity of 9.397
learned: 16859 types with 181888 tokens and lexical diversity of 10.789
fiction: 9302 types with 68488 tokens and lexical diversity of 7.363
mystery: 6982 types with 57169 tokens and lexical diversity of 8.188
reviews: 8626 types with 40704 tokens and lexical diversity of 4.719
government: 8181 types with 70117 tokens and lexical diversity of 8.571
adventure: 8874 types with 69342 tokens and lexical diversity of 7.814
news: 14394 types with 100554 tokens and lexical diversity of 6.986

Your turn: compute the conditional frequency distribution of bigrams in a corpus

Hint:


In [25]:
for ngram in nltk.ngrams(["The", "bear", "walked", "in", "the", "woods", "at", "midnight"], 5):
    print(ngram)


('The', 'bear', 'walked', 'in', 'the')
('bear', 'walked', 'in', 'the', 'woods')
('walked', 'in', 'the', 'woods', 'at')
('in', 'the', 'woods', 'at', 'midnight')

Preprocessing Text

NLTK is great at the preprocessing of raw text - it provides the following tools for dividing text into it's constituent parts:

  • sent_tokenize: a Punkt sentence tokenizer:

    This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

    However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

  • word_tokenize: a Treebank tokenizer

    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

  • pos_tag: a maximum entropy tagger trained on the Penn Treebank

    There are several other taggers including (notably) the BrillTagger as well as the BrillTrainer to train your own tagger or tagset.


In [26]:
import bs4
from readability.readability import Document

# Tags to extract as paragraphs from the HTML text
TAGS = [
    'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li'
]

def read_html(path):
    with open(path, 'r') as f:

        # Transform the document into a readability paper summary
        html = Document(f.read()).summary()

        # Parse the HTML using BeautifulSoup
        soup = bs4.BeautifulSoup(html)

        # Extract the paragraph delimiting elements
        for tag in soup.find_all(TAGS):

            # Get the HTML node text
            yield tag.get_text()

In [27]:
for paragraph in read_html('fixtures/nrRB0.html'):
    print(paragraph + "\n")



 It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible. 



‘Rotissi-fried’ chicken at the Partisan

Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes. Why both? “Everything is better once it’s fried in beef fat,” Anda said. We have to agree. Whether white or dark, the meat is succulent throughout. The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially. The sound of it shattering under the knife was music to our ears. And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce. The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations.

 The Partisan, 709 D St. NW. 202-524-5322. www.thepartisandc.com.  

— Becky Krystal

Traditional fried chicken at Family Meal

When Bryan Voltaggio started planning the menu for Family Meal, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken. “It was one of our favorite things,” he says. “It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers. That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu. The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch. After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh. You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist? 

 Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. www.voltfamilymeal.com.  

— John Taylor

Japanese fried chicken at Izakaya Seki

Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, kara age chicken — like most of the country’s food — is held to an extremely high standard. “It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns Izakaya Seki on V Street NW. “Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken. Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil. Izakaya Seki’s version sticks closely to the formula. Probably. “I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved. The result is a thin, tender coating that’s slightly softer than tempura. The accompanying ponzu sauce lends a tartness to the nubs.

 Izakaya Seki, 1117 V St. NW. 202-588-5841. www.sekidc.com.  

— Holley Simmons

Korean fried chicken at BonChon

Don’t waste your kimchi-stinking breath asking for more sauce at BonChon. The South Korean fried chicken chain, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications. And why would you want to change anything, really? The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon. Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece. True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines. Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer. Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard.

 BonChon, 1015 Half St. SE and nine other locations in Maryland and Virginia. www.bonchon.com.  

— Holley Simmons

Maryland fried chicken at Crisfield Seafood and Hank’s Oyster Bar



There’s not much agreement on what constitutes Maryland fried chicken. Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak. The pan-fried chicken platter at  Crisfield Seafood is a perfect example of the former style. Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan. This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on. (The chicken is available only Friday through Sunday, and frequently sells out.) The Chesapeake fried chicken at Hank’s Oyster Bar in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy. It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday. 

 Crisfield Seafood, 8012 Georgia Ave., Silver Spring. 301-589-1306. www.crisfieldseafood.com. Hank’s Oyster Bar, 1624 Q St. NW. 202-462-4265; 633 Pennsylvania Ave. SE. 202-733-1971. www.hanksoysterbar.com. 

— Fritz Hahn

Fancy fried chicken at Central

Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant. It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam. But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on Central Michel Richard’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever. Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste. It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!) French. Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken. It is, after all, that kind of a place. 

 Central Michel Richard, 1001 Pennsylvania Ave. NW. 202-626-0015. www.centralmichelrichard.com .   



— Maura Judkis



Nashville hot chicken at Reserve 2216

If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women. Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot. Decades later, chefs are latching onto this addictive form of punishment. Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of Reserve 2216 in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville. He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it. Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles. He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back. But not too hard. This is Alexandria, after all. 

 Reserve 2216, 2216 Mount Vernon Ave., Alexandria. 703-549-2889. www.drpreserve.com. 

— Tim Carman



Fast-food fried chicken at Popeyes

The sole virtue of most fast-food operations is consistency. Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same. The menu at Popeyes follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal). Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion. No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice. The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue. No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain. Once, I got home to discover a clerk had forgotten to pack bread in my bag. I almost cried. Instead, I consoled myself with another piece of chicken. 

 Popeyes has locations throughout the D.C. metro area. www.popeyes.com. 

— Tom Sietsema



Fried chicken sandwich at DCity Smokehouse

The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity. Leave it to Rob Sonderman, pitmaster and co-owner of DCity Smokehouse, to bring dignity back to the bite. His Den-Den — named for co-creator and pitmaster-in-training Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey. Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer. Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch. Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce). No matter. You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat.

 DCity Smokehouse, 8 Florida Ave. NW. 202-733-1919. www.dcitysmokehouse.com.  

— Tim Carman



Classic D.C. fried chicken at Oohh’s and Aahh’s

Hearty is the appetite that can handle Oohh’s and Aahh’s chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers. He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating. Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother.

 Oohh’s and Aahh’s, 1005 U St. NW. 202-667-7142. www.oohhsnaahhs.com. 

— Bonnie S. Benwick



Popcorn fried chicken at Pop’s Sea Bar

It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy Pop’s Sea Bar in Adams Morgan. The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order. A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99). Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite.

 Pop’s Sea Bar, 1817 Columbia Rd. NW. 202-534-3933. www.popsseabar.com. 

— Bonnie S. Benwick



Fried chicken tenders at GBD

Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at GBD are a fine meal for adults and children alike. Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice. But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on D.C.’s own mumbo sauce, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter. Ask for the $5.50 Saucetown option to try all nine. 

 GBD, 1323 Connecticut Ave. NW. 202-524-5210. www.gbdchickendoughnuts.com. 

— Margaret Ely



Fried chicken skins at Gypsy Soul

Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast. But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken. And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones. And that is when you grab one of the bar stools at R.J. Cooper’s Gypsy Soul in Fairfax’s Mosaic District. Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic. The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own. 

 Gypsy Soul, 8296 Glass Alley, Fairfax. 703-992-0933. www.gypsysoul-va.com. 

— Fritz Hahn

/Users/benjamin/Repos/ddl/pycon2016/notebooks/tutorial/venv/lib/python3.5/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

In [28]:
text = u"Medical personnel returning to New York and New Jersey from the Ebola-riddled countries in West Africa will be automatically quarantined if they had direct contact with an infected person, officials announced Friday. New York Gov. Andrew Cuomo (D) and New Jersey Gov. Chris Christie (R) announced the decision at a joint news conference Friday at 7 World Trade Center. “We have to do more,” Cuomo said. “It’s too serious of a situation to leave it to the honor system of compliance.” They said that public-health officials at John F. Kennedy and Newark Liberty international airports, where enhanced screening for Ebola is taking place, would make the determination on who would be quarantined. Anyone who had direct contact with an Ebola patient in Liberia, Sierra Leone or Guinea will be quarantined. In addition, anyone who traveled there but had no such contact would be actively monitored and possibly quarantined, authorities said. This news came a day after a doctor who had treated Ebola patients in Guinea was diagnosed in Manhattan, becoming the fourth person diagnosed with the virus in the United States and the first outside of Dallas. And the decision came not long after a health-care worker who had treated Ebola patients arrived at Newark, one of five airports where people traveling from West Africa to the United States are encountering the stricter screening rules."

for sent in nltk.sent_tokenize(text): 
    print(sent)
    print()


Medical personnel returning to New York and New Jersey from the Ebola-riddled countries in West Africa will be automatically quarantined if they had direct contact with an infected person, officials announced Friday.

New York Gov.

Andrew Cuomo (D) and New Jersey Gov.

Chris Christie (R) announced the decision at a joint news conference Friday at 7 World Trade Center.

“We have to do more,” Cuomo said.

“It’s too serious of a situation to leave it to the honor system of compliance.” They said that public-health officials at John F. Kennedy and Newark Liberty international airports, where enhanced screening for Ebola is taking place, would make the determination on who would be quarantined.

Anyone who had direct contact with an Ebola patient in Liberia, Sierra Leone or Guinea will be quarantined.

In addition, anyone who traveled there but had no such contact would be actively monitored and possibly quarantined, authorities said.

This news came a day after a doctor who had treated Ebola patients in Guinea was diagnosed in Manhattan, becoming the fourth person diagnosed with the virus in the United States and the first outside of Dallas.

And the decision came not long after a health-care worker who had treated Ebola patients arrived at Newark, one of five airports where people traveling from West Africa to the United States are encountering the stricter screening rules.


In [29]:
for sent in nltk.sent_tokenize(text):
    print(list(nltk.wordpunct_tokenize(sent)))
    print()


['Medical', 'personnel', 'returning', 'to', 'New', 'York', 'and', 'New', 'Jersey', 'from', 'the', 'Ebola', '-', 'riddled', 'countries', 'in', 'West', 'Africa', 'will', 'be', 'automatically', 'quarantined', 'if', 'they', 'had', 'direct', 'contact', 'with', 'an', 'infected', 'person', ',', 'officials', 'announced', 'Friday', '.']

['New', 'York', 'Gov', '.']

['Andrew', 'Cuomo', '(', 'D', ')', 'and', 'New', 'Jersey', 'Gov', '.']

['Chris', 'Christie', '(', 'R', ')', 'announced', 'the', 'decision', 'at', 'a', 'joint', 'news', 'conference', 'Friday', 'at', '7', 'World', 'Trade', 'Center', '.']

['“', 'We', 'have', 'to', 'do', 'more', ',”', 'Cuomo', 'said', '.']

['“', 'It', '’', 's', 'too', 'serious', 'of', 'a', 'situation', 'to', 'leave', 'it', 'to', 'the', 'honor', 'system', 'of', 'compliance', '.”', 'They', 'said', 'that', 'public', '-', 'health', 'officials', 'at', 'John', 'F', '.', 'Kennedy', 'and', 'Newark', 'Liberty', 'international', 'airports', ',', 'where', 'enhanced', 'screening', 'for', 'Ebola', 'is', 'taking', 'place', ',', 'would', 'make', 'the', 'determination', 'on', 'who', 'would', 'be', 'quarantined', '.']

['Anyone', 'who', 'had', 'direct', 'contact', 'with', 'an', 'Ebola', 'patient', 'in', 'Liberia', ',', 'Sierra', 'Leone', 'or', 'Guinea', 'will', 'be', 'quarantined', '.']

['In', 'addition', ',', 'anyone', 'who', 'traveled', 'there', 'but', 'had', 'no', 'such', 'contact', 'would', 'be', 'actively', 'monitored', 'and', 'possibly', 'quarantined', ',', 'authorities', 'said', '.']

['This', 'news', 'came', 'a', 'day', 'after', 'a', 'doctor', 'who', 'had', 'treated', 'Ebola', 'patients', 'in', 'Guinea', 'was', 'diagnosed', 'in', 'Manhattan', ',', 'becoming', 'the', 'fourth', 'person', 'diagnosed', 'with', 'the', 'virus', 'in', 'the', 'United', 'States', 'and', 'the', 'first', 'outside', 'of', 'Dallas', '.']

['And', 'the', 'decision', 'came', 'not', 'long', 'after', 'a', 'health', '-', 'care', 'worker', 'who', 'had', 'treated', 'Ebola', 'patients', 'arrived', 'at', 'Newark', ',', 'one', 'of', 'five', 'airports', 'where', 'people', 'traveling', 'from', 'West', 'Africa', 'to', 'the', 'United', 'States', 'are', 'encountering', 'the', 'stricter', 'screening', 'rules', '.']


In [30]:
for sent in nltk.sent_tokenize(text):
    print(list(nltk.pos_tag(nltk.word_tokenize(sent))))
    print()


[('Medical', 'JJ'), ('personnel', 'NNS'), ('returning', 'VBG'), ('to', 'TO'), ('New', 'NNP'), ('York', 'NNP'), ('and', 'CC'), ('New', 'NNP'), ('Jersey', 'NNP'), ('from', 'IN'), ('the', 'DT'), ('Ebola-riddled', 'JJ'), ('countries', 'NNS'), ('in', 'IN'), ('West', 'NNP'), ('Africa', 'NNP'), ('will', 'MD'), ('be', 'VB'), ('automatically', 'RB'), ('quarantined', 'VBN'), ('if', 'IN'), ('they', 'PRP'), ('had', 'VBD'), ('direct', 'JJ'), ('contact', 'NN'), ('with', 'IN'), ('an', 'DT'), ('infected', 'JJ'), ('person', 'NN'), (',', ','), ('officials', 'NNS'), ('announced', 'VBD'), ('Friday', 'NNP'), ('.', '.')]

[('New', 'NNP'), ('York', 'NNP'), ('Gov', 'NNP'), ('.', '.')]

[('Andrew', 'NNP'), ('Cuomo', 'NNP'), ('(', '('), ('D', 'NNP'), (')', ')'), ('and', 'CC'), ('New', 'NNP'), ('Jersey', 'NNP'), ('Gov', 'NNP'), ('.', '.')]

[('Chris', 'NNP'), ('Christie', 'NNP'), ('(', '('), ('R', 'NNP'), (')', ')'), ('announced', 'VBD'), ('the', 'DT'), ('decision', 'NN'), ('at', 'IN'), ('a', 'DT'), ('joint', 'JJ'), ('news', 'NN'), ('conference', 'NN'), ('Friday', 'NNP'), ('at', 'IN'), ('7', 'CD'), ('World', 'NNP'), ('Trade', 'NNP'), ('Center', 'NNP'), ('.', '.')]

[('“We', 'NNS'), ('have', 'VBP'), ('to', 'TO'), ('do', 'VB'), ('more', 'JJR'), (',', ','), ('”', 'NNP'), ('Cuomo', 'NNP'), ('said', 'VBD'), ('.', '.')]

[('“It’s', 'VB'), ('too', 'RB'), ('serious', 'JJ'), ('of', 'IN'), ('a', 'DT'), ('situation', 'NN'), ('to', 'TO'), ('leave', 'VB'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('honor', 'NN'), ('system', 'NN'), ('of', 'IN'), ('compliance.”', 'NN'), ('They', 'PRP'), ('said', 'VBD'), ('that', 'IN'), ('public-health', 'NN'), ('officials', 'NNS'), ('at', 'IN'), ('John', 'NNP'), ('F.', 'NNP'), ('Kennedy', 'NNP'), ('and', 'CC'), ('Newark', 'NNP'), ('Liberty', 'NNP'), ('international', 'JJ'), ('airports', 'NNS'), (',', ','), ('where', 'WRB'), ('enhanced', 'VBN'), ('screening', 'NN'), ('for', 'IN'), ('Ebola', 'NNP'), ('is', 'VBZ'), ('taking', 'VBG'), ('place', 'NN'), (',', ','), ('would', 'MD'), ('make', 'VB'), ('the', 'DT'), ('determination', 'NN'), ('on', 'IN'), ('who', 'WP'), ('would', 'MD'), ('be', 'VB'), ('quarantined', 'VBN'), ('.', '.')]

[('Anyone', 'NN'), ('who', 'WP'), ('had', 'VBD'), ('direct', 'JJ'), ('contact', 'NN'), ('with', 'IN'), ('an', 'DT'), ('Ebola', 'NNP'), ('patient', 'NN'), ('in', 'IN'), ('Liberia', 'NNP'), (',', ','), ('Sierra', 'NNP'), ('Leone', 'NNP'), ('or', 'CC'), ('Guinea', 'NNP'), ('will', 'MD'), ('be', 'VB'), ('quarantined', 'VBN'), ('.', '.')]

[('In', 'IN'), ('addition', 'NN'), (',', ','), ('anyone', 'NN'), ('who', 'WP'), ('traveled', 'VBD'), ('there', 'RB'), ('but', 'CC'), ('had', 'VBD'), ('no', 'DT'), ('such', 'JJ'), ('contact', 'NN'), ('would', 'MD'), ('be', 'VB'), ('actively', 'RB'), ('monitored', 'VBN'), ('and', 'CC'), ('possibly', 'RB'), ('quarantined', 'VBD'), (',', ','), ('authorities', 'NNS'), ('said', 'VBD'), ('.', '.')]

[('This', 'DT'), ('news', 'NN'), ('came', 'VBD'), ('a', 'DT'), ('day', 'NN'), ('after', 'IN'), ('a', 'DT'), ('doctor', 'NN'), ('who', 'WP'), ('had', 'VBD'), ('treated', 'VBN'), ('Ebola', 'NNP'), ('patients', 'NNS'), ('in', 'IN'), ('Guinea', 'NNP'), ('was', 'VBD'), ('diagnosed', 'VBN'), ('in', 'IN'), ('Manhattan', 'NNP'), (',', ','), ('becoming', 'VBG'), ('the', 'DT'), ('fourth', 'JJ'), ('person', 'NN'), ('diagnosed', 'VBD'), ('with', 'IN'), ('the', 'DT'), ('virus', 'NN'), ('in', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('and', 'CC'), ('the', 'DT'), ('first', 'JJ'), ('outside', 'NN'), ('of', 'IN'), ('Dallas', 'NNP'), ('.', '.')]

[('And', 'CC'), ('the', 'DT'), ('decision', 'NN'), ('came', 'VBD'), ('not', 'RB'), ('long', 'RB'), ('after', 'IN'), ('a', 'DT'), ('health-care', 'JJ'), ('worker', 'NN'), ('who', 'WP'), ('had', 'VBD'), ('treated', 'VBN'), ('Ebola', 'NNP'), ('patients', 'NNS'), ('arrived', 'VBD'), ('at', 'IN'), ('Newark', 'NNP'), (',', ','), ('one', 'CD'), ('of', 'IN'), ('five', 'CD'), ('airports', 'NNS'), ('where', 'WRB'), ('people', 'NNS'), ('traveling', 'VBG'), ('from', 'IN'), ('West', 'NNP'), ('Africa', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('are', 'VBP'), ('encountering', 'VBG'), ('the', 'DT'), ('stricter', 'NN'), ('screening', 'NN'), ('rules', 'NNS'), ('.', '.')]

All of these taggers work pretty well - but you can (and should train them on your own corpora).

Stemming and Lemmatization

We have an immense number of word forms as you can see from our various counts in the FreqDist above - it is helpful for many applications to normalize these word forms (especially applications like search) into some canonical word for further exploration. In English (and many other languages) - morphological context indicate gender, tense, quantity, etc. but these sublties might not be necessary:

Stemming = chop off affixes to get the root stem of the word:

running --> run
flowers --> flower
geese   --> geese 

Lemmatization = look up word form in a lexicon to get canonical lemma

women   --> woman
foxes   --> fox
sheep   --> sheep

There are several stemmers available:

- Lancaster (English, newer and aggressive)
- Porter (English, original stemmer)
- Snowball (Many languages, newest)


The Lemmatizer uses the WordNet lexicon


In [24]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer

text = list(nltk.word_tokenize("The women running in the fog passed bunnies working as computer scientists."))

snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()

for stemmer in (snowball, lancaster, porter):
    stemmed_text = [stemmer.stem(t) for t in text]
    print(" ".join(stemmed_text))


the women run in the fog pass bunni work as comput scientist .
the wom run in the fog pass bunny work as comput sci .
The women run in the fog pass bunni work as comput scientist .

In [59]:
from nltk.stem.wordnet import WordNetLemmatizer

# Note: use part of speech tag, we'll see this in machine learning! 
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in text]
print(" ".join(lemmas))


The woman running in the fog passed bunny working a computer scientist .

Note that the lemmatizer has to load the WordNet corpus which takes a bit.

Typical normalization of text for use as features in machine learning models looks something like this:


In [28]:
import string
from nltk.corpus import wordnet as wn

## Module constants
lemmatizer  = WordNetLemmatizer()
stopwords   = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation

def tagwn(tag):
    """
    Returns the WordNet tag from the Penn Treebank tag.
    """

    return {
        'N': wn.NOUN,
        'V': wn.VERB,
        'R': wn.ADV,
        'J': wn.ADJ
    }.get(tag[0], wn.NOUN)


def normalize(text):
    for token, tag in nltk.pos_tag(nltk.wordpunct_tokenize(text)):
        #if you're going to do part of speech tagging, do it here
        token = token.lower()
        if token in stopwords and token in punctuation:
            continue
        token = lemmatizer.lemmatize(token, tagwn(tag))
        yield token

print(list(normalize("The eagle flies at midnight.")))


['the', 'eagle', 'fly', 'at', 'midnight', '.']

Named Entity Recognition

NLTK has an excellent MaxEnt backed Named Entity Recognizer that is trained on the Penn Treebank. You can also retrain the chunker if you'd like - the code is very readable to extend it with a Gazette or otherwise.


In [36]:
print(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("John Smith is from the United States of America and works at Microsoft Research Labs"))))


(S
  (PERSON John/NNP)
  (PERSON Smith/NNP)
  is/VBZ
  from/IN
  the/DT
  (GPE United/NNP States/NNPS)
  of/IN
  (GPE America/NNP)
  and/CC
  works/VBZ
  at/IN
  (ORGANIZATION Microsoft/NNP Research/NNP Labs/NNP))

You can also wrap the Stanford NER system, which many of you are also probably used to using.


In [21]:
import os
from nltk.tag import StanfordNERTagger

# change the paths below to point to wherever you unzipped the Stanford NER download file
stanford_root = '/Users/benjamin/Development/stanford-ner-2014-01-04'
stanford_data = os.path.join(stanford_root, 'classifiers/english.all.3class.distsim.crf.ser.gz')
stanford_jar  = os.path.join(stanford_root, 'stanford-ner-2014-01-04.jar')

st = StanfordNERTagger(stanford_data, stanford_jar, 'utf-8')
for i in st.tag("John Smith is from the United States of America and works at Microsoft Research Labs".split()):
    print('[' + i[1] + '] ' + i[0])


[PERSON] John
[PERSON] Smith
[O] is
[O] from
[O] the
[LOCATION] United
[LOCATION] States
[LOCATION] of
[LOCATION] America
[O] and
[O] works
[O] at
[ORGANIZATION] Microsoft
[ORGANIZATION] Research
[ORGANIZATION] Labs

Parsing

Parsing is a difficult NLP task due to structural ambiguities in text. As the length of sentences increases, so does the number of possible trees.


In [31]:
for name in dir(nltk.parse):
    if not name.startswith('_'): print(name)


BllipParser
BottomUpChartParser
BottomUpLeftCornerChartParser
BottomUpProbabilisticChartParser
ChartParser
DependencyEvaluator
DependencyGraph
EarleyChartParser
FeatureBottomUpChartParser
FeatureBottomUpLeftCornerChartParser
FeatureChartParser
FeatureEarleyChartParser
FeatureIncrementalBottomUpChartParser
FeatureIncrementalBottomUpLeftCornerChartParser
FeatureIncrementalChartParser
FeatureIncrementalTopDownChartParser
FeatureTopDownChartParser
IncrementalBottomUpChartParser
IncrementalBottomUpLeftCornerChartParser
IncrementalChartParser
IncrementalLeftCornerChartParser
IncrementalTopDownChartParser
InsideChartParser
LeftCornerChartParser
LongestChartParser
MaltParser
NaiveBayesDependencyScorer
NonprojectiveDependencyParser
ParserI
ProbabilisticNonprojectiveParser
ProbabilisticProjectiveDependencyParser
ProjectiveDependencyParser
RandomChartParser
RecursiveDescentParser
ShiftReduceParser
SteppingChartParser
SteppingRecursiveDescentParser
SteppingShiftReduceParser
TestGrammar
TopDownChartParser
TransitionParser
UnsortedChartParser
ViterbiParser
api
bllip
chart
dependencygraph
earleychart
evaluate
extract_test_sentences
featurechart
load_parser
malt
nonprojectivedependencyparser
pchart
projectivedependencyparser
recursivedescent
shiftreduce
transitionparser
util
viterbi

Similar to how you might write a compiler or an interpreter; parsing starts with a grammar that defines the construction of phrases and terminal entities.


In [51]:
grammar = nltk.grammar.CFG.fromstring("""

S -> NP PUNCT | NP
NP -> N N | ADJP NP | DET N | DET ADJP
ADJP -> ADJ NP | ADJ N

DET -> 'an' | 'the' | 'a' | 'that'
N -> 'airplane' | 'runway' | 'lawn' | 'chair' | 'person' 
ADJ -> 'red' | 'slow' | 'tired' | 'long'
PUNCT -> '.'
""")

In [60]:
def parse(sent):
    sent = sent.lower()
    parser = nltk.parse.ChartParser(grammar)
    for p in parser.parse(nltk.word_tokenize(sent)):
        yield p 

        
for tree in parse("the long runway"): 
    tree.pprint()
    tree[0].draw()


(S (NP (DET the) (ADJP (ADJ long) (N runway))))

NLTK does come with some large grammars; but if constructing your own domain specific grammar isn't your thing; then you can use the Stanford parser (so long as you're willing to pay for it).


In [61]:
from nltk.parse.stanford import StanfordParser

# change the paths below to point to wherever you unzipped the Stanford NER download file
stanford_root  = '/Users/benjamin/Development/stanford-parser-full-2014-10-31'
stanford_model = os.path.join(stanford_root, 'stanford-parser-3.5.0-models.jar')
stanford_jar   = os.path.join(stanford_root, 'stanford-parser.jar')

st = StanfordParser(stanford_model, stanford_jar)
sent = "The man hit the building with the baseball bat."
for tree in st.parse(nltk.wordpunct_tokenize(sent)):
    tree.pprint()
    tree.draw()


(ROOT
  (S
    (NP (DT The) (NN man))
    (VP
      (VBD hit)
      (NP (DT the) (NN building))
      (PP (IN with) (NP (DT the) (NN baseball) (NN bat))))
    (. .)))